DOCUMENT RESUME 



ED 227 175 

AUTHOR 
TITLE 



/ 

INSTITUTION 

SPONS AGENCY 
PUB DATE 
NOTE 



PUB TYPE 



EDRS PRICE 
DESCRI PTORS 



IDENTIFIERS 



ABSTRACT 



TM 83p 204 

Chopin / Bruce H. • 
Extracting More Information from Multiple Choice 
Tests: Analytic Techniques for t«he 
Answer-until-Correct Mode: 

California Univ., Los Angeles. Center for the Study 
of Evaluation. 

National Inst, of Education (ED), Washington, DC. 
Apr 83 

19p.; Paper presented at the Annual Meeting of the 
American Educational Research Association (67th, 
Montreal, Quebec, Canada, April 11-15, 1983). 
Speeches/Conference Papers (1-50) — Reports - 
Research/Technical (143) 

MF01/PC01 Plus Postage. - 
Computer Assisted Testing; Difficulty Level; 
*Guessing (Tests); Knowledge Level; *Latent Trait 
, Theory;- *Mathematical Models;- *Measurement 
Techniques;, *Multiple Choice Tests; Psychometrics • 
Test Items 

*Answer Until Correct; Distractors (Tests); One 
Parameter Model; *Partial Knowledge (Tests); Rasch 
Model 



- . - In the answer-until-correct mode of multiple-choice 

testing, respondents are directed to continue choosing among the- 
alternatives to each item until they find the correct response. There 
is no consensus as ta-how to convert the resulting pattern of . 
responses into a measure because of two conflicting models of item 
response behavior. The f i rst . suggests that partial knowledge allows 
the subject to eliminate some distractors immediately, and then 
assumes essentially 'random guessing among the remainder . The second 
proposes that the first error made by the subject results from 
misinformation, but that guessing comes in'to play after that. The 
paper considers three latent trait measurement models from each of 
these perspectives. Each is an extension of the RasCh one-parameter 
logistic model. The first, which is. most relevant to the partial 
knowledge viewppint, is based on a -count of the error choices before 
the correct response is identified. The second calibrates the 
difficulty of each step in each item. The third calibrates the 
difficulty of each distractor. It is argued that the second model 
provides the >best context for distinguishing between the 
misinformation and partral knowledge approaches. (Author/PN) 
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EXTRACTING MORE INFORMATION FROM MULTIPLE CHOICE TESTS: 
ANALYTip TECHNIQUES FOR THE ANSWER-UNTIL-CORRECT MODE ' 

r - . . 

1. Introduction 

■ — - — / 

Bough they are convenient to use and have some desirable 
psychometric properties, multiple choice tests have been widely 
attacked (Wood, 1977). Three specific criticism's that have been made 
against conventional multiple choice tests are: 

1) That they face tbe testee with three or four times as many 
incorrect statements as correct ones and provide no feedback 
to help the student learn the correct answers. 

2) That they encourage random guessing. , > 

\, 

3) , That they are inefficient in that little information is ' 

gained about the student from his response to a single item. 

The "answer-yntil -correct" testing mode (Brown, 1965; Hanna, 
1975) is designed 4x> overcome these problems. 'Inthis mode the 
student is presented with instant feedback to a response. If the* 1 * 
response is correct, the student is directed to continue to the next 
question, but if the response is incorrect he or she is asked to 
attempt the item again.' This form of testing has the advantage of 
extracting significantly more information "about a student 1 s J ability 
from a given. number of items, and thus m^kes it easier to distinguish 4 
between different levels of partial knowledge or part mastery. It has 
also been suggested that this response mode may reduce the incidence 
of random guessing behavior among students, and it has the additional 
benefit that (most of the time) the final answer chosen by the student 



to an item is also the correct one. There is, a priori, reason to - 
believe that this response, the one that receives positive 
. reinforcement,, is the one most likely to be remembered. 

A number of Yesearch Studies have focused on the characteristics 
and usefulness of an swer-un til -correct testing. For example, Merwin 
* (1959), Brown-'(1965) and Frary (1980) investigated various scoring 
procedures. None of the more complex, al ternatrves they Vied appeared 
to improve significantly on Brown's simple approach of reducing the 
total score by one point for every incorrect "distractor selected. 
Hanna (1975), and Kane & Moloney (1978), investigated the implications 
of AUC responding for reliability and validity. Hanna suggested that 
the AUC procedure increased reliability but generally appeared to' 
decrease validity (as measured by correlation with a substantive 
external criterion). The implication is that testwiseness may play a 
more significant role on AUC tests than on conventional tests. This ■ 
relates, back to MerwiVs earlier paper in .which he concluded that if 
>£test constructors were to reap the potential' advantages of the AUC " 
^procedure, .then item distractors would have to be carefully designed 
so as to relate in a clear way. to the criterion variable. 

Much of the earlier work reported on • this 'topic displayed 
considerable vagueness* as to the presumed behavior of the student when 
taking a test. 

A careful reading and analysis of the Jpgic presented suggests 
that the writers were assuming the relevance of one or the other of 
two contrasting and incompatible models.. The first, which may be 



called the partial knowledge "mo del , assumes that the student may know" 
enough about the subject matter with which the item is concerned in' • 
order to be able to eliminate one or more of the distractors with some 
certainty. He is then presumed to guess at random among those that 
remain. Complete mastery of the problem involves the certain 
elimination of all butone of the alternative responses so that the 
student chooses the correct. answer without guessing. 

The second model assumes that a student arrives at an incorrect 
, response not through .some guessing procedure, but through the 
application of misinformation . Under the..answer-until -correct 
procedure, such a student having applied his misinformation to obtain 
the wrong answer, is forced to choose'again. The feedback that the 
first piece of misinformation is incorrect may provide important 
incidental .learning. The next choice may be a random'guess, or 
another. response selected on the basis of misinformation. 

Frary showed that the AUC procedure was effective in 
^discriminating between students when they operated on the basis of 
partial information, but suggested that the scoring procedure could be 
improved for students operating the. misinformation model . Wilcox 
(1982) further considers the distinction between the partial knowledge 

and misinformation models and discusses appropriate rules for scorinq 

• - / 

tests when the latter operates. Unfortunately, it would appear that 

* 

in practice many individuals use both strategies when taking tests, 
and it is difficult to, tell when looking at the pattern of results on 
which- items they were employing partial knowledge and on which 
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misinformation. Questioning students following the administration of 
an AUC test could help to clarify this issue. 

The answer-untiV-correct procedure has made comparatively little 
impact on the field of educational testing in the seventeen years v 
since Brown's paper for two reasons: 

• • » 

. (a) the lack of convenient and appropriate technology "for 

providing instant feedback to the student, since clinfcal 
► .administration^ tests is prohibitively expensive; a'n.d ' 

(b) the absence of a s.ound theoretical' -base for turning the data 

into measures, for while Brown's system appears to work in' 

practice, there is no model to substantiate it or check its 

validity. , w 

On the first issue, there have been *a number of recent > 

developments. Answer-until -.correct tests currently in use Con an 

experimental, or regular basis) use one of three different 'feedback 

technologies. The first approach, requires an answer sheet preprinted 

in invisible ink, so that wheii the student responds (using a special 

pen) a portion^of the preprinted material becomes visible, and the 

student obtains the appropriate feedback. The second method involves 

haying the student erase a shield printed over the top of the feedback 

information again on a specially, prepared answer sheet. Each of these 

approaches requires some special equipment for preparing the answer ■ 

sheets which have to be customized to fit a particular test. However, 

th.is equipment is now fairly generally available, and the answer 

sheetis produced from it are not unduly expensive. 

The third approach involves testing by the computer. This' method 

is potentially superior to the other methods because it allows the 
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recording of the sequence in which particular responses are chosen. 
The first two methods described permit the inference that the correct 
response was chosen last, but do not easily allow the earlier 
incorrect responses to be ordered. Until very recently the computer 
was far too expensive to be considered seriously for routine use as a 
test administering device, but the rapid development of terminals and 
in particular of inexpensive micro processers opens up new 
possibilities. 

The computer is able not only -to record the sequence in which 
distractors are selected, but also to accumulate other information 
(e.g., how long was the delay between each response), and continually 
update estimates of the student's level of performance and the 
measurement precision. If is also able to provide more or less 
detailed feedback , under the control of the test constructor, and to 
provide the feedback in ah entirely standard fashion so that no 
inadvertent clues are presented. During the last year, a team at the 
Center for the Study of Evaluation has devoted considerable effort to 
developing an effective and efficient program for administering 
answer-until -correct tests using Apple microcomputer systems. We have 
designed tfiis system so as to be useful to classroom teachers who 
currently have access to Apple or similar computers, and also to us 
in collecting answer-until -correct data for our psychometric research. 

The rest of this paper will be devoted to describing the latent 
trait models which address the second "of the problems mentioned 



earlier, the -absence of a scUd theoretical base for turning the 
response data into a measure.* # ' * 

2. Latent Trait Models " 



Three new latent trait models will" be described. They differ 
from one another in their complexity, though. each is designed to yield 
-a single parameter to measure student achievement.. 

The simplest,^ "partial credit" model has a single difficulty 
parameter for each item. It is the latent trait analogue for Brown's 
(1965) integer scoring scheme based on the number of attempts needed 
to recfch the correct response. The scoring is from 1.0 for a correct 
response on the first attempt to 0.0 for failure ,in (m-1) attempts, 
where there are m alternatives presented for an item (see Figure 1). 
This model takes no account of the variations jn" distractor attrac- 1 

tiveness from item to item, nor of wjiich distractors were actually 

\ . - 

selected by the respondent. 

: . I 

The second latent traif model t^ats. the test as a sequence of » 
distinct steps each of which has a difficulty parameter. A single 
five-way multiple choice item can be regarded as comprising four 
steps, with each successive step after the first being attempted if, 
and only if, the preceding one is failed. The scoring is I/O for each 
step, with s\eps not attempted being co'ded as incomplete data (Figure 
2). This produces four difficulty parameters for each item, but a 
single and. more precise ability estimate for the individual. The' 

* • 

method does not assume that all the items have the same logical 
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structure with regard to difficulty, but -it takes no acco'unt of 
exactly which detractors are selected. 

The third mpdel is an> extension of tfje second... In this model, 
the step difficulty Values for an item vary in terms of which 
detractors were* previously selected. Thus for a five-way multiple 
choice item there is one difficulty parameter at the first step, four 
at the second, six at the third,, four at the fourth. This give a 
total of fifteen' difficulty parameters for a single five-way multiple 
choice item. It should in general give a better fit" than the model 
described above because it treats the distractors individually, but it 
requiresmore data for the necessary calibration of the item 
parameters. , 

To some extent, the utility of these models is going to depend pn 
the, relative preponderance of the two -styles of student behavior j 

\ * ^ * * 

discussed earlier. Under partial knowledge , distractor elimination ' 
and random guessing ( style A ) the noise introduced by guessing 
precludes the possibility of very precise measurement, aid the first 
model described may well prove as effective. as either of the others. 
Where item 'responses based on correct information 'or misinformation 
(style_B) dominate, we would expect that models two 'and three would, 
provide more precise and valid measures of student performance.' 

Each of the models described is based on the simple one-parameter 
Rasch logistic model. This is for two reasons. Firstly, the Rasch 
model seems the logical choice in a situation which involves the 
construction of new test instruments, since it focuses attention on 
meeting the logical rquirements for objective measurement. Secondly, 



the main alternative, the three-parameter logistic model, has severe 
practical 'limitations even when applied to regular test 'data. 
Estimating techniques are primitive,, and very large samples are - ' 
^ required in order to' obtain stable parameter estimates. The 
three-parameter model has been found useful 'in describing large bodies 
of existing data derived from tests of varied qualiti, but such data 
sets do not exist in the AUC format. Since obtainin/g sufficient data 
for adequate item calibration Anticipated to be J problem even for 
the Rasch model, it appeared- sensible to concentrate initial efforts 
in this direction. * 
Model (i): Fixed Partial Credit , 

(a v -6j) 

The model is E(X V1 -) = — °v 

. % + e (a v-; 5 i) 

« * * ■ , 

where: E(X V1 -) is the expected , score of pers6n v on item i 

% is a parameter describing the ability of person v. 
*f is a parameter describing the difficulty of item i 

* 

and tne scoring function Xw,- =' 1 , ; ' 

6 r i ■ 

where m i is the number of alternative choices on item i (of which 1 is 

correct and (m-1) are incorrect) 
and g v1 - is the number of attempts by person v on -item i until the 

correct alternative is chosen. If the (mj-l)th attempt • 
• fai-ls then X v1 -=o. 



> 
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ft 

/ , ' 

The rationale for this scoring scheme is based on a "partial 
knowledge" distractor elimination" model.' If a correct response is 
chosen at the first attempt, then it is assumed that the student was 
able to eliminate all the distractors, and so he or she gets full " 
credit. If the first attempt fails, but the second attempt* is 
correct, it is assumed that he or she could eliminate all the 

distractors but one, so that credit of m " 2 is awarded. ,(The 

m-1 

number of distractors is (m-1)). 

* jf 

Although this equal -interval scoring function may appear somewhat 
arbitrary it is analogous to that frequently adopted in Elementary 
Scaling techniques (e.g., Likert scales). Moreover, Andersen (1977) 
has shown" that for.the model to retain specific objectivity, 
successive scoring categories must be equidistant., The immediate 

1 

advantage of this is that the "raw score" by a- student who has nogked. 
through the set of items is .a sufficient statistic for the ability 
(and frequents/may be used. instead of it— hence the viability of the 
scheme proposed bysJlrown). . * 

\ sea 

Parameter estimation is approached via a modification of the 
Rasch PAIR estimation alg^rnhi/Tchoppin, 1982). For two items i and 
j, the relative difficulty caVbe estimated by 



J ^ . . *i ; f, - log b d1 - log b 



I 



L. 



where, on this occasion .'b^-. is the sum over all people in the sample, 
of Xj(l-Xj) and bjiis similarly defined. .(It can be seen that this ' 
reduces to the standard PAIR algoi&thm in -the case of 1/0* scoring.) 
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Wl-Xj) represents the product of an estimate'* the extent to 
which item i is mastered multiplied by an estimate pftthe extent to 
which item f is-npt mastered. It may be viewed, for eTch subject as 
measure of the extent to which item i is easier than item j.'^he 
ratio: 



E fi(i-XjT] 



a value independent of a 



which is why the accumulation of data over persons to estimate these 
expectations works. 

The a>gebra for maximum likelihood estimation, Snd for 
controlling the model via the squared matrix B* exactly duplicates ' 
' that laid out in Choppin '(1982), except that the formulae presented 
there for the standard errors ;of the "6 -values are no longer 
-appropriate. (Corrected formulae have not ye"t been developed, so the 
Values reported by PA^are used as' conservative guides.) Once the 
items are calibrated, the estimation of person ability again follows 
the PAIR procedure. \ 
Model (ii): step Calibrat ion * 

- t , 

- In this model, the probability of person'y responding correctly 
to item i .at the'gth attempt, given ihat he or she makes the attempt, 
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Prob 



•f7x vig = n-, — M 9 K\- ■ 

L -J I + e^v-^ig) • x 



4 . 



where X v1 -g = ,1 if the gth attempt at item i is successful ,* and 
0 otherwise 

°^ is a"gain a parameter describing the abfj/ity^jf person v 
and 6 ig is a parameter describing the difficulty of the gth step on 
item i. 

m . ^For a five-way multiple choice item there are fWe possible sets 
of observation vectors X, with asterisks indicating missirfg data 
(i.e., attempts that do not occur). 



Correct at first attempt: 
Correct at second attempt: 
.Correct at th.ird attempt: 
Correct at fourth attempt: 
Failure at -fourth attempt: 



l 
o 
o 
0 

0 



*\ 

1 

0 
0 
0 



3 *4 



* 
* 

1 
0 
0 



* 

*• 

* 

1 

0 



If the. raw. data ^ojjfi analyzed consists' of cd$& numbers for the 
successful attempt on each .item, tften.it must be transformed into the 
above .format; for the calibration analysis,'.. For "example, suppose that 
a^ fndivi-duarl Required ' (z, V 1,, 4-V 5, 3) attests to.mnd.the correct 
answers to a, six item five-way ,gul tip! e. choice-test. \ The recoding of 
this.vec tor. would yfeld: ' • ' ' J- • 





' 1 * * 


N I * * * 


,0 0 0 1 


0 0.0 0 


0 0 1 * 




« 
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) a vectoyf 24 elements. A set of such vectCKg from the different 
persons attempti ng- the test can be analyzed. almost as a standard Rasch 
model' prdblem-providing^the PAIR algorithm (Choppin, 1982) iskused to 

<a ... 

allow for the embedded missing data. The deviation from the standard 
Jtasch procedure is necessitated by the violation 'of the local* ' 
independence assumption for AUC data. While it remains important that 
between -items this 'independence is maintained, it is clear that within 
an item the different X-values cannot be independent. As shown above, 
only m possible patterns out of the 3 m theoretically possible on v 
each item ever occur 'and certain combinations such as (1,0) are 
.impossible. - ~ 



This invalidates the maxi mum "Wjceli hood estimation procedure which 
assumes that the elements of the B matrix for item pairs are 



essentially independent. 

Thei full, theoretical implicationsXof Ihis are still being 
explored, but a convenient "fix" -in ordiVto calibrate the items -is to 
use instead qf-WL a least squares procedure- based on a modified B* 
matrix. This B*, instead of being simply the square of .matrix B as 
before, is now screened to remove the contaminating dependence within 



i terns . 



In the 'standard PAIR algorithm ' [ 

• b *ii - } b ik b jk \ 

and since b.. * b'.. = o, b*^ is independent of b^. 

In PAIR as modified for AUC tests 
k 

b *iJ = 1 Vik v kj b kj 
where vf k are the elements of a screening matrix such that 

vj,q = 0 if responses -p. and q relate to the same item 
^and v pq = i otherwise. 



\ Least. squares estimation procedure applied to the B* matrix yields 
calibrations for the 6- values (i = 1, k; g = 1, m-1). 

The estimation of person ability, 'the usual goal in such 
exercises, is somewhat different than in the standard Rasch model. 
Apart from rare failures at the final attempt, each student will score 
one point on each item and thus will have a raw sopre of k. 

However, this raw score will be based on different numbers of 
"attempts", and individual . step difficulties- will*, be higher on some 
items than on others. Therefore a v is estimated by the solution of 



' if. 



v e a v 



where the summation extends over the item steps actually attempted, 
and r y is the observed raw score (usually k). This equation ca^i 
always be solved to produce a unique LS estimation of ct v but may 
be inefficient since its (iterative) solution is' required for each 

observed score pattern. Monte Carlo simulation could compare the / 

. . . . , • a / 

variation m a with the scoring function proposed by Brown (1965), no 

see whether the exact iterative solution is worthwhile. 

The standard errors of such estimates -depend upon >the number of 

attempts made. Thus someone who usually responds>^Q£^ctly at the 

first attempt will be measured with less precision trtan someone who 

typically requires two or three attempts.. Data in which the mean 

number of attempts per item is 2.0 (a typical value) will yield 
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standard errors of measurement only 0.7 times as large as with "a 
t conventional test with the same number of items. From this it can be 
'seen that major increases in precision can only be achieved by 
substantially increasing the number of alternative's per question, so 
that the number of attempts made before success will also increase. 
Model (iii): Distractor'tal ibration <. 

This model is an extension of (ii) to allow for differences among 
the distractors. The item step difficulty parameter riow describes* the 
difficulty of the item at each step taking account qf which * 
distractors have already been eliminated'. 

Thus «ii indicates the difficulty of item i at the initial 
step when all distractors are present 
• °i2.A indicates the difficu-lty of -item i at the second 

» 

)•. step -when distractor A was chosen at v the first 
6 i3.BC indicates, the 'difficulty of item t at the third step 

after distractors B and C have been chosen (in 
*\ whatever order) 

* * 

With this notation, the model becomes 



Prob 



[jvig.F = lj 



,(°v- 6 1g.F) 



1 + e 



^ a v" 6 ig.F^ 



2 



• ' The analysis and estimation procedures essentially follow those 
for mqdel (ii) except that the response data must be coded in , 
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diffe>ent format. For a five-way item (for, which the correct response 
is E, and the abstractors- are labeled A-D), the structure of the 
parameters to be estimated is: y 



.Su 


Si2.ft 


Si2.B 


Si2.£ 


Si2.D 


Si3.ftB j Si3.ftcJ Si3.ftD 


Si3.BC 


Si3.BD 


Si3.CD 




S34.ABD 


&i4.ACD 


Si4.BCD 



Response data for an individual who chose responses A, C, E, in 
that order, getting the item right at the third attempt, would be 
coded 
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0 * * ★ 


* ★ ★ ★ 


* ★ * ★ 



It should be noted that this coding scheme is, severely 
. constrained. There. is at most one- entry in each block, and a "1" 
entry effectively terminates the vector. Thus the range of possible 
response patterns is limited, and.again the local independence 
- principle, is violated. > ' * 

Estimation' procedures can follow the sequence described in model 
(ii) first to calibrate the item step values, and secondly to estimate 
the person ability parameters. However;, it is apparent that the 
procedure is somewhat unwieldy. For each item .the number of 
difficulty^ parameters to be estimated is. given by (2 m ~ l - 1) where m. 
ts the number of alternative responses in the item format. Inadequate 
calibration of the parameters due to insufficient data can spoil the 
overall measurement of person ability (viz: person measurement with 
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the'Lord-Birnbaum three- pa raster model and small '.data sets). A six 
item five-way multiple choice test such as that described under model 
(ii) wojild require the estimation of 90 item difficulty parameters" 
under model (iii) as opposed to 24 under model ' (iU. For this moael , 
in contrast to model (ii), it would seem wise to restrict item formats 
to not more than thr.ee or four alternatives. 
3. Trial Data Analysis 

Calibration procedures fo« models (i) and (ii) have been 
programmed in fORTRAN using variations of the PAIR algorithm described 
above. Both programs have demonstrated their ability to recover the 

* a 

parameter values used to generate artificial "fitting" data. Two data 
sets from AUC tests each comprising several hundred cases have been 
analyzed using these programs. One test is a junior high school 
science test under development in England. The second is a college 
level psychology test used in a private Cal ifornia university. The 
results are still being studied. . ". * *~ 

Model '(ii'i). requires the coding of which distr actors were 

-v^ — ■ 

selected in which sequence, and this is only practicable with a 
clinically administered or computer administered test. For this 
reason we have .devoted considerable time to. developing a software 
package that" will administer AUC tests in schools, and store the 
results in af format suitable for aggregation and subsequent analysis. 
Detail.s of this package are given in the Appendix. 
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